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Abstract 

In  this  paper  we  address  the  problem  of multi-class  pattern  classification  using  adap¬ 
tive  layered  networks.  We  view  such  networks  as  performing  generalised  linear  dis- 
crin  inant  analysis  in  which  a  particular  parametric  form  is  assumed  for  the  nonlinear 
functions.  Training  the  network  consists  of  a  least-square  approach  which  combines 
a  generalised  inverse  computation  to  solve  for  the  final  layer  weights,  together  with  a 
nonlinear  optimisation  scheme  to  solve  for  parameters  of  the  nonlinearities.  Such  an 
approach  performs  feature  extraction  and  classification  simultaneously,  in  which  the 
feature  extraction  is  (optimally)  matched  to  the  classification  scheme.  We  derive  a 
general  analytic  form  for  the  feature  extraction  criterion  and  interpret  it  for  specific 
forms  of  target  coding  and  error  weighting.  A  particular  aspect  of  the  approach  is 
to  exhibit  how  a  priori  information  regarding  nonuniform  class  membership,  uneven 
distribution  between  train  and  test  sets  and  misclassification  costs  may  be  exploited 
in  a  regularised  manner  in  the  training  phase  of  networks. 
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1  Introduction 

Connectionist  models  based  on  adaptive  layered  networks  (e.g.  Multilayer  Pereeptrons  [17] 
and  Radial  Basis  Functions  [2])  have  been  used  with  some  success  when  operating  as  static 
pattern  classifiers  in  problems  as  diverse  as  sonar  classification  [9],  speech  recognition  [16] 
and  medical  diagnosis  [1],  The  ability  of  feed-forward  layered  networks  to  perform  static 
pattern  discrimination  stems  from  their  potential  to  create  a  specific  nonlinear  transforma¬ 
tion  into  a  space  spanned  by  the  outputs  of  the  hidden  units  in  which  class  separation  is 
easier  [21].  This  transformation  is  constrained  to  maximise  a  specific  feature  selection  cri¬ 
terion  [21]  which  may  be  viewed  as  a  nonlinear  multi-dimensional  generalisation  of  Fisher’s 
Linear  Discriminant  function  [5], 

The  expression  for  the  network  feature  extraction  criterion  derived  in  [21]  involves  the 
weighted  between  class  covariance  matrix  (where  the  weighting  is  determined  by  the  square 
of  the  number  of  patterns  in  each  class).  This  implies  that  adaptive  networks  trained  on  a 
1-from-n  classifier  problem  bias  strongly  in  favour  of  those  classes  which  have  the  largest 
membership  in  the  training  data.  Thus,  in  order  to  minimise  the  error  over  the  entire 
training  set,  the  optimum  solution  for  the  network  parameters  is  such  that  the  network 
misclassifies  patterns  in  classes  with  smallest  representation  in  favour  of  those  with  larger 
representation  in  the  training  set,  irrespective  of  the  frequency  of  occurrence  in  actual 
‘operation’.  Also,  training  of  networks  often  takes  no  account  of  costs  of  misclassification. 

These  are  undesirable  features  of  networks  (and  many  other  standard  classifiers)  in  prob¬ 
lems  where  information  on  one  particular  class  may  be  more  difficult  or  expensive  to  obtain 
than  other  classes,  and  where  it  is  important  to  consider  the  costs  of  misclassification.  For 
instance,  in  speech  recognition  the  bulk  of  the  continuous  acoustic  signal  consists  of  silence 
whereas  the  dominant  information  content  is  contained  in  the  subword  units  (“phonemes’). 
Another  example,  which  is  considered  in  more  depth  in  a  separate  paper  [12]  is  in  the  prob¬ 
lematic  realm  of  medical  prognosis:  given  a  feature  pattern  as  determined  from  a  set  of 
observations  on  a  patient,  what  are  the  likely  future  health  prospects  of  that  patient.  In 
this  case  the  importance  of  misclassification  could  literally  be  a  matter  of  life-and-death  if 
resources  had  to  be  limited  to  those  in  most  need  who  would  gain  maximum  benefit. 

These  are  illustrative  of  pattern  classification  problems  where  the  distribution  of  patterns 
amongst  the  different  classes  in  the  training  set  is  nonuniform  and  also  follows  a  different 
distribution  to  the  expected  occurrence  or  relative  importance  of  the  classes  in  operation. 
In  addition,  there  may  be  further  prior  knowledge  which  could  be  used  to  associate  to  each 
class  a  cost  of  misclassification  with  any  other  class. 

Heuristic  methods  may  be  developed  which  attempt  to  compensate  for  some  of  these 
effects  in  training  adaptive  networks.  For  instance,  the  classes  of  the  training  data  may  be 
sampled  according  to  a  distribution  which  reflects  the  prior  probability  distribution  and  the 
network  is  subsequently  trained  on  the  sampled  data.  Alternatively,  if  training  proceeds 
iteratively,  the  number  of  iterations  in  the  learning  cycle  of  a  network  may  be  varied  for  each 
class  which  would  have  a  similar  compensatory  effect.  Equivalently,  the  sum  square  error 
minimised  by  the  network  during  training  may  be  weighted  by  the  frequency  of  occurrence 
of  the  patterns  in  each  class. 

In  this  paper,  we  adopt  a  least-squares  approach  to  pattern  classification  using  adaptive 
feed-forward  classification  networks.  For  a  network  with  linear  output  units,  we  may  view 
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the  final  layer  as  performing  a  generalised  linear  discriminant  function  [4]  with  a  specific 
parametric  form  for  the  nonlinearities.  Optimisation  of  the  network  results  in  adaptation  of 
the  nonlinearities.  For  various  combinations  of  target  coding  and  error  weighting  schemes, 
we  derive  the  feature  extraction  criteria  which  are  being  maximised  by  the  network.  The 
following  section  reviews  the  optimality  of  the  least-squares  approach  and  Section  3  places 
the  adaptive  feed-forward  network  within  that  framework.  An  expression  for  the  network 
feature  extraction  criterion  is  derived  and  in  Section  4  its  properties  are  explored  for  various 
target  coding  and  error  weighting  schemes. 


2  Least-mean-squares  Pattern  Analysis 


The  least-mean-square-error  design  criterion  has  been  widely  used  in  pattern  recognition 
since  it  assumes  no  prior  knowledge  of  class  distributions  or  a  priori  probabilities  [3,  13,  22, 
23].  In  this  section  we  shall  review  the  properties  of  the  least-mean-square  approach  which 
allow  it  to  obtain  an  optimum  set  of  weights  for  a  generalised  linear  discriminant  function 
in  which  the  nonlinear  functions  are  assumed  known. 

Consider  a  (possibly  nonlinear)  vector  function  g(x)  =  (ffi(*),ff2(*)>- •  •  iffncC*))’  of  a 
pattern  x  =  (z1,*2, . . .  ,*„)*.  A  generalised  linear  discriminant  function  ,  O,  has  the  form 
[4,  10] 

O  =  Ao  +  £A>*(*),  (!) 

j=i 

where  Ay  are  the  weights,  Ao  is  a  bias  term  and  there  are  n0  nonlinear  functions,  g,,  of 
the  pattern  *.  A  pattern  classification  device  such  as  this,  employing  a  fixed  nonlinear 
transformation  g(x)  followed  by  a  linear  transformation,  has  also  been  termed  a  phi  machine 
[15,  24].  Several  different  forms  for  the  nonlinear  functions,  g,  have  been  considered.  For 
example,  Specht  [18]  uses  a  polynomial  discriminant  function  for  pattern  classification  in 
which  the  coefficients  are  determined  by  an  expansion  of  an  approximation  to  the  probability 
density  function. 

For  a  c-class  problem,  we  may  form  c  discriminant  functions 

po 

Oh  =  Aoi  +  Aysff,(»),  *=l,...,e  (2) 

7=1 

and  for  P  patterns  xl,x2,.. .  ,xp  we  seek  a  solution  for  the  parameters,  (A,y, i  =  0, ...  ,no, 
j  =  l,...,c},  which  minimises  the  mean-square  error 

E  =  4EllAff(*‘)  +  *o-<‘llI 

P  <=i  (3) 

=  i||AG  +  A„  -  rif 

where  G  is  a  no  x  P  matrix  whose  p-th  column  is  g{xT)\  T  is  a  c  x  P  ‘target’  matrix  with 
p-th  column  tp  =  (tj,  tp, . . .  the  target  for  the  p-th  pattern  xr;  A  is  the  c  x  no  matrix 
of  weights  (Ay,  =  A,,)  and  Ao  =  (Aoi,A02...,Aoc)'  is  the  vector  of  biases.  These  functions 
g  may  be  viewed  as  performing  feature  extraction  according  to  some  fixed  rule  and  the 
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parameters  A  and  Ao  are  chosen  in  an  optimal  way.  The  target  matrix  is  the  matrix  of 
desired  outputs  for  the  given  set  of  training  patterns. 

There  have  been  several  different  interpretations  for  the  target  matrix,  T  and  these  lead 
to  different  decision  rules  [22].  We  shall  consider  the  target  matrix  in  more  detail  in  Section 
4. 


Now,  as  the  number  of  training  samples,  P,  approaches  infinity,  and  provided  that  the 
training  samples  are  obtained  at  the  same  relative  frequency  as  the  occurrence  of  samples 
of  the  process,  then  ni/P  (where  n<  is  the  number  of  samples  in  class  i)  tends  to  pi,  the 
prior  probability  and  the  limit  of  the  error  E  is  given  by  [3,  22] 

E  -  Ex  =  £p,(||Aa(z)  +  A0-*,-||J>i  (4) 

;= i 

where  the  expectation  is  with  respect  to  the  conditional  distribution  of  X  on  class  i,  i.e.  for 
any  function  z  of  z, 

(z(z))i  =  J  z(x)p(x\i)dx. 

s,  =  (su,  «ji,  •••,««)*  denotes  the  prototype  target  vector  for  class  i  (the  columns  of  T  are 
comprised  of  the  vectors  «,). 

Thus,  Equation  (4)  gives  the  large  sample  limit  of  the  mean  error.  The  solution  for  A 
and  A0  which  minimises  £„  also  minimises  [3,  22] 

E'  =  <||Aa(*)  +  Ao  -  p(z)||J),  (5) 

where  the  expectation  is  with  respect  to  the  unconditional  distribution,  p(z),  of  z  and  p(z) 
is  defined  as 

C 

p(*)  = 

.=i  (bj 

=  Sp 

where  p  is  the  c-dimensional  vector  of  posterior  probabilities  given  the  pattern  z  and  S  is 
the  c  x  c  matrix  of  prototype  target  vectors.  Thus,  p(z)  may  be  viewed  as  a  ‘conditional 
target’  vector  :  it  is  the  expected  target  vector  given  a  pattern  z,  with  the  property  that 

C  , 

(p(*))  =  /  p(*)jH*)<**  =  (7) 

■'-<*  i=l 

the  mean  target  vector.  From  Equations  (4)  and  (5)  the  discriminant  rule  which  minimises 
Eqo  has  minimum  variance  from  the  discriminant  vector  p. 

Interpreting  the  prototype  target  matrix,  S,  as  a  set  of  cost  vectors, 

Sji  =  cost  of  assigning  to  class  j  a  pattern  which  belongs  to  class  i 


then  p(z)  is  the  conditional  risk  vector  [4],  with  i-th  component  the  conditional  risk  of 
deciding  in  favour  of  class  i.  The  Bayes  decision  rule  for  minimum  conditional  risk  is 

assign  pattern  x  to  class  i  if  p,(z)  < 
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Thus,  from  Equations  (4)  and  (5),  the  discriminant  ruie  which  minimises  E  has  minimum 
variance  from  the  optimum  Bayes  discriminant  function,  p,  as  the  number  of  training  sam¬ 
ples  approaches  infinity. 


For  a  1-from-c  target  coding  scheme,  we  take 


Sii  = 


{i 


ifi=j 

otherwise. 


This  may  be  viewed  as  a  gain  matrix  in  which  the  gain  of  assigning  to  class  j  a  pattern 
which  belongs  to  class  i  is  zero  (t  ^  j),  but  is  unity  for  correct  classification.  The  vector 
p(x)  given  by  Equation  (7)  is  equal  to  p ,  the  vector  of  posterior  probabilities  and  the  Bayes 
discriminant  rule  for  minimum  error  is  [6] 


assign  pattern  x  to  class  l  if  pi(x)  >  pj(i),  j  = 


Thus,  for  this  coding  scheme,  the  least-mean-square  solution  for  A  and  Ao  gives  a  vector 
discriminant  function  which  has  minimum  variance  from  the  vector  of  a  posteriori  prob¬ 
abilities.  This  may  also  be  achieved  by  taking  the  target  matrix  to  be  the  ‘equal-cost’ 
matrix, 


|  0  if  i  =  j 
\  1  otherwise 


or  changing  the  sign  of  the  components  5,;  to  make  5  a  matrix  of  costs  (i.e.  a  cost  of  -1 
for  making  a  correct  decision  and  zero  otherwise),  and  the  Bayes  decision  rule  for  minimum 
risk  (4,  6]. 


Therefore,  the  least-mean-square  approach  is  optimal  for  the  two  particular 
choices  of  target  matrix  in  that  the  discriminant  function  obtained  has  minimum 
variance  from  the  optimal  Bayes  discriminant  function  in  the  limit  of  an  infinite 
amount  of  data. 


We  should  make  a  note  on  the  decision  rule  here,  since  the  output  of  a  trained  pattern 
recognition  system  is  not  p(*),  but  the  least -mean-square  approximation  to  it  given  by 
A g(x)  +  A0  for  a  pattern  x.  We  would  like  a  decision  rule  which  is  consistent  with  the 
mean-square  approach,  yet  reduces  to  the  Bayes  decision  rules  for  minimum  error  and 
minimum  risk  when  the  targets  are  chosen  appropriately,  either  as  a  1-from-c  target  coding 
scheme  or  a  risk  vector  coding,  since  in  these  situations,  the  discriminant  rule  is  an  optimal 
approximation  (in  the  least-mean-square  sense)  to  the  Bayes  discriminant  vector,  p(*). 
Since  the  transformation  given  by  the  weights  and  biases,  {A},  maps  the  vectors  given 
by  the  columns  of  G  to  the  targets,  T,  in  a  least-squares  sense,  the  logical  decision  rule 
(whatever  the  interpretation  of  the  matrix  T)  is  the  nearest  neighbour  decision  :  decide  x 
belongs  to  class  i  if 

l!°  -  *11  <  II O  -  *>||  j=\ . c  (8) 

where  o  =  Kg(x)  +  A0  is  the  output  vector  of  the  system,  *,  are  the  prototype  target 
vectors  of  class  t  which  form  the  columns  of  T  and  |]*)|  is  the  Euclidean  norm.  If  there  are 
the  same  number  of  patterns  as  there  are  classes  (c  =  P)  then  this  is  the  nearest  neighbour 
decision  rule. 
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In  the  situation  where  the  prototype  target  vectors  are  given  by  a  1-from-c  coding, 
so  that  the  discriminant  rule  is  an  optimal  approximation  to  the  posterior  probabilities, 
then  the  decision  rule  above  is  equivalent  to  assigning  the  vector  z  to  class  i  if  o,  is  the 
largest  element  of  the  output  vector.  Thus  the  minimum  distance  decision  rule  for  the 
approximat ion  o  to  the  optimal  vector  is  identical  to  the  decision  rule  based  on  the  optimal 
outputs  (i.e.  to  take  the  largest  component  of  the  discriminant  vector  as  the  class). 

However,  for  an  arbitrary  loss  matrix  as  the  target  matrix,  so  that  the  discriminant  rule 
is  an  optimal  approximation  to  the  Bayes  conditional  risk  vector,  the  minimum  distance 
classifier  does  not  reduce  to  the  Bayes  decision  for  minimum  risk  (which  is  to  take  the 
smallest  component  of  the  discriminant  vector  as  the  correct  class).  Therefore,  the  most 
obvious  decision  rule  for  mean-square  training  does  not  reduce,  under  limiting  cases  of  the 
target  matrix,  to  both  the  Bayes  decision  rule  for  minimum  risk  and  the  Bayes  decision 
rule  for  minimum  error.  This  is  important  since  it  shows  that  the  most  commonly  used 
decision  rule  in  traditional  statistical  pattern  analysis,  Bayes  minimum  risk,  is  not  in  general 
a  special  case  of  the  most  natural  decision  rule  for  this  formulation,  namely  the  minimum 
distance  rule. 


Returning  to  Equation  (4)  for  the  error,  we  see  that  the  error  £»  may  be  expressed  in 
terms  of  the  error,  £',  since 


Ex  =  £p,(|[AG  +  Ao-a.l!J). 

1  =  1 

=  £ p,(!|AG  +  Ao  -#>(*)  +  p(*)  - 
1  =  1 


(9) 


which  may  be  expanded  to  give  [4) 

-  E'  1  -i  h  Ao)‘(p(z)  -  a.)),  (10) 

i— i  i=i  •= l 

The  final  term  in  the  above  expression  is  identically  equal  to  zero  by  definition  of  p.  There¬ 
fore  we  have 

=  £' +  I>(M^  -  £><I!p(*)||j>.',  (ii) 

»=1  »=1 

where,  for  the  target  matrix  taken  to  be  a  matrix  of  losses ,  the  quantity  £'=1  ps(||p(z)||  ), 
is  defined  as  the  generalised  Bayesian  distance  [3]. 


It  is  straightforward  to  show  that  for  an  arbitrary  target  matrix 

-  !><iip(*)ii,}<  >  o 

<=i  i=i 

and  so  we  have 

£„>£’>  0 


(12) 


(13) 


Therefore,  the  solution  obtained  by  minimising  £«,  gives  an  error  which  is  an  upper  bound 
for  E'. 


In  principle,  the  error  E'  may  be  made  arbitrarily  small  by  a  suitable  choice  of  the  func¬ 
tions  p,(z),  provided  that  the  rank  of  A ff(z)  is  not  less  than  the  rank  of  p.  In  the  following 
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section,  we  shall  consider  a  particular  parametric  form  for  these  nonlinear  functions,  whose 
parameters  may  be  determined  readily  using  an  optimisation  strategy.  Then,  not  only  may 
we  choose  the  values  of  the  weights  A  and  A<j  to  ensure  that  £<*,  is  a  minimum  in  the  space 
spanned  by  these  weights,  but  also  we  may  choose  the  parameters  of  the  functions  g<  to 
ensure  that  the  error  is  a  minimum  in  the  space  spanned  by  the  weights  and  the  parameters 
of  Si- 

However,  first  it  must  be  emphasised  that  in  practice  we  do  not  minimise  the  error,  E oc, 
given  by  Equation  (4),  but  rather  the  finite  sample  approximation  to  it  given  by  Equation 
(3)  and  the  solution  for  the  weights  may  be  written  down  as  follows.  Minimising  E  with 
respect  to  the  bias  vector,  Ao,  gives 


Ao  =  t  —  Aj 


where  I  is  the  mean  target  vector  given  by 


and  g  is  the  mean  vector  of  nonlinearity  outputs 


1  '  , 
» =  pX>(j 


P=1 


Using  this  solution  for  Ao,  the  error  may  be  written 

E  =  -illt-AGr2, 

where  T  and  G  are  defined  as 

T  =  T-il' 

G  =  G-g  1* 

where  1  is  a  P  x  1  vector  of  l’s. 


(14) 

(15) 

(16) 

(17) 

(18) 


The  solution  for  A  which  minimises  E  with  minimum  (Frobenius)  norm  is 

A  =  T(G)+,  (19) 

where  G+  is  the  Moore-Penrose  pseudo-inverse  of  G  [7]. 

In  the  following  section  we  exploit  this  solution  and  the  properties  of  the  pseudo-inverse 
to  show  that  minimising  the  error  is  equivalent  to  maximising  a  particular  feature  extraction 
criterion  in  the  space  spanned  by  the  parameters  of  the  functions  9,  [21,  3]. 


3  Networks  and  Nonlinear  Discriminant  Analysis 


In  this  section  we  shall  consider  a  particular  form  for  the  nonlinearities  in  the  discriminant 
function,  namely  a  layered  feed-forward  classification  network.  This  class  of  structures 
attempts  to  map  a  set  of  P,  n-dimensional  feature  patterns  onto  a  set  of  P,  n'-dimensional 
target  patterns.  The  set  of  P  feature-target  pattern  pairs  constitutes  the  training  set.  In 
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its  simplest  form,  we  consider  a  network  with  n  input  nodes  which  are  fully  connected 
to  a  hidden  layer  of  no  nodes  with  arbitrary  nonlinear  transfer  functions.  The  input  to 
each  node  may  be  taken  to  be  the  scalar  product  between  the  pattern  produced  at  the 
outputs  of  the  previous  layer  and  the  set  of  weights  attached  to  the  links  of  that  node. 
However,  other  ‘combination  rules’  may  also  be  used  [2].  These  hidden  nodes  are  fully 
connected  to  a  set  of  n'  output  nodes  and  for  purposes  of  classification,  we  consider  that 
the  number  of  output  nodes  n'  =  e ,  the  number  of  classes.  We  could  consider  arbitrary 
numbers  of  layers,  nonlinearities  and  combination  rules  (between  the  weights  in  one  layer 
and  the  output  patterns  of  the  previous  layer)  without  affecting  the  analysis.  However,  we 
restrict  the  network  to  one  with  output  units  which  have  linear  transfer  functions,  and  so 
the  transformation  from  the  patterns  created  in  the  space  of  the  hidden  units  to  the  output 
space  is  linear.  Thus,  the  output  of  the  i-th  output  node  for  input  pattern  z  is  identical 
to  the  generalised  linear  discriminant  function  Equation  (1), 

no 

Ok  =  A0t  +  51  A  jSff;(*)  (20) 

>=> 

where  \}k  are  now  viewed  as  the  connections  from  the  j— th  hidden  unit  to  the  ifc-th  output 
unit  (A0k  is  a  bias  term  for  the  i-th  output)  and  g,  is  the  nonlinear  function  of  the  input 
associated  with  the  j-th  hidden  node.  For  a  single  hidden-layer  multilayer  perceptron  with 
nonlinear  transfer  functions,  <t>,  associated  with  the  j-th  hidden  unit  (often  taken  to  be  a 
logistic  function),  and  inputs  to  each  node  given  by  a  scalar  product  of  the  input  with  a 
vector  of  weights  attached  to  the  links  of  that  node,  Equation  (20)  may  be  written  as 

no  /  n  \ 

Ok  =  A0*  +  51  A;*c,  (  )  ’  (21) 

where  z,  is  the  i-th  component  of  the  input  feature  vector  to  the  network,  z,  and  fi,,  are 
the  weights  in  the  first  layer  connecting  the  i-th  input  to  the  ;-th  hidden  node. 

Training  consists  of  adapting  the  parameters  of  the  network  (the  weights  and  biases)  by 
any  suitable  optimisation  strategy  [20]  (since  the  function  and  it’s  derivative  as  determined 
by  the  network  are  known  analytically  from  (20))  to  minimise  the  square  error  between  the 
desired  target  patterns  and  the  actual  output  patterns  of  the  network  over  the  entire  training 
set.  Generalisation  ability  depends  on  the  network  being  complex  enough  (as  determined 
by  the  number  of  hidden  units)  to  mode!  adequately  the  structure  in  the  training  data 
without  being  over-complex  which  would  allow  the  network  to  fit  the  superimposed  noise 
on  the  data. 

Training  a  network  usually  involves  the  minimisation  of  a  sum-square  error  of  the  form 

E  =  j\\T-0  DJ  (22) 

where  T  is  the  c  x  P  matrix  of  target  vectors  and  O  is  the  e  x  P  matrix  of  network  output 
vectors.  We  have  seen  in  Section  2  that  this  criterion  is  optimum  in  the  sense  that  the 
solution  obtained  for  the  final  layer  weights  (in  the  limit  of  P  —  oo)  yields  a  discriminant 
function  which  (for  a  set  of  loss  vectors  as  target  vectors)  has  minimum  variance  from  the 
optimum  Bayes  solution  for  minimum  risk.  However,  in  the  feed-forward  network  considered 
here  we  are  not  simply  optimising  the  final  layer  weights,  but  also  the  parameters  of  the 
nonlinearities  (the  feature  extraction  functions)  by  adjusting  the  first  layer  weights.  All  of 
these  weights  may  be  adjusted  simultaneously  by  some  suitable  optimisation  sthenic.  The 
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one  most  commonly  used  is  steepest  descents,  but  others  have  been  considered  and  their 
performance  on  a  range  of  data  sets  assessed  [19].  Am  alternative  approach  is  to  consider  the 
error  to  be  a  function  of  the  first  layer  weights  alone,  choosing  the  final  layer  weights,  {A}, 
using  Equations  (14)  and  (19).  The  optimisation  strategy  then  adjusts  the  first  layer  weights 
and  for  each  adjustment  solves  Equations  (14)  and  (19)  for  the  final  layer  weights.  This 
ensures  that  the  error  is  always  a  minimum  with  respect  to  the  final  layer  weights.  Thus, 
the  optimisation  strategy  is  to  choose  the  parameters  of  fK-  nonlinearities,  {p},  using  some 
nonlinear  optimisation  strategy  so  that  the  optimum  linear  discriminant  error,  achieved 
by  the  final  linear  transformation,  is  minimised.  This  hybrid  linear-nonlinear  optimisation 
strategy  has  been  assessed  in  {20). 

The  advantage  of  representing  the  nonlinear  functions,  gj,  as  a  transformation  performed 
by  a  feed-forward  layered  network  is  that  we  may  easily  exploit  the  functional  form  in  order 
to  minimise  the  output  error.  Of  course,  this  amounts  to  nothing  more  than  the  chain  rule 
for  differentiation  and  a  gradient  rule  for  function  minimisation  and  it  is  in  ‘operation’, 
when  a  network  is  required  to  perform  real-time  classification  or  control,  that  network  ar¬ 
chitectures  will  give  most  gain.  However,  p^ameterisation  of  the  nonlinear  functions,  by 
whatever  means,  leads  to  increased  complexity  of  the  pattern  recognition  algorithm.  Al¬ 
though  this  may  give  an  improved  performance  on  a  training  set,  performance  on  an  unseen 
test  set  (generalisation)  may  be  poor  due  to  possible  overtraining.  In  pattern  recognition, 
there  has  been  considerable  research  devoted  to  the  trade-off  between  dimensionality,  sam¬ 
ple  size  and  algorithm  complexity  [1 1].  For  example,  one  might  naively  expect  that  as 
the  dimension  of  a  measurement  vector  is  increased,  the  classification  performance  should 
also  increase.  However,  in  practice,  the  performance  of  a  classifier  is  seen  to  improve  up 
to  a  point  and  then  decrease.  This  is  also  expected  to  apply  to  network-based  classifiers 
for  which  the  number  of  adjustable  parameters  increases  as  the  input  dimension  increases. 
Therefore,  as  with  any  problem  in  estimation,  a  natural  requirement  is  that  the  number  of 
samples  available  significantly  exceed*  the  number  of  model  parameters.  For  a  feed-forward 
network,  a  relationship  between  the  number  of  training  nodes,  n©,  the  number  of  training 
patterns,  P ,  and  the  maximum  number  of  separable  regions  has  been  given  in  [14],  but  rules 
relating  the  dimension,  sample  set  size  and  complexity  to  classifier  performance  are,  so  far, 
elusive. 

The  error  equation  (22)  may  be  written  in  a  more  general  form  if  we  consider  the 
optimum  effects  of  minimising  a  weighted  error  function  where  the  p-th  training  pattern 
may  be  weighted  by  the  real  factor 

E  =  'II*. 

p=l 

=  i||(T-0)D||J  <23) 

=  i||(T-  \H)D\\2 

where  o?  is  the  p-th  output  vector  and  p-th  column  of  the  c  x  P  matrix  O;  the  matrix  D 
is  a  P  x  P  diagonal  matrix  whose  elements  are  y/^\  and  H  is  the  no  x  P  matrix  of  hidden 
unit  outputs. 

In  the  following  section  we  shall  examine  various  combinations  of  target  coding  method¬ 
ology  and  suitable  choices  for  the  error  weighting  factors  dp,  which  allow  the  performance 
of  a  network  in  operation  to  be  tailored  by  prior  knowledge. 
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Because  the  output  nodes  have  linear  transfer  functions,  the  linear  transformation  per¬ 
formed  by  the  final  layer  may  be  inverted  by  pseudo-inverse  methods  to  reveal  the  optimum 
distribution  of  patterns  in  the  no-dimensional  space  at  the  output  of  the  hidden  units  which 
enables  the  minimum  error  to  be  achieved.  It  is  found  (see  Appendix  A  for  details)  that 
the  choice  of  weights  in  the  initial  layer  projects  the  training  patterns  by  a  nonlinear  trans¬ 
formation  into  a  distribution  at  the  output  of  the  hidden  layer  so  as  to  maximise  a  feature 
extraction  criterion  given  by 

J  =  Tr{sBS+}  (24) 

where  Tr  denotes  the  trace  of  a  matrix,  Sj  is  the  pseudo-inverse  of  Sj-  and  the  matrices 
S £ ,  St  are  defined  as 

ST  =  jHD'B"  (25) 

and 

SB  =  ^5 HD1t'tD2H‘  (26) 

where  H ,  T  are  defined  in  Appendix  A  to  be  the  mean-shifted  set  of  patterns  at  the 
output  of  the  hidden  layer  and  the  mean-shifted  set  of  target  patterns  on  the  training 
set  (where  the  mean  is  weighted  by  the  prior  expectations).  The  matrices  St  and  SB 
have  the  interpretation  of  being  the  total  and  between  class  covariance  matrices  of  the 
output  patterns  of  the  hidden  units.  Precise  interpretations  depend  on  specific  target  coding 
schemes  and  weighting  factors  which  will  be  considered  in  the  following  section.  Thus  the 
optimum  method  of  solution  of  such  a  network  is  to  find  a  nonlinear  transformation  into  the 
space  spanned  by  the  hidden  units  such  that  the  patterns  in  different  classes  are  somehow 
maximally  separated  (this  information  is  contained  in  the  between  class  covariance  matrix), 
while  still  maintaining  an  overall  total  normalisation  (through  the  total  covariance  matrix). 
In  this  sense,  feed-forward  layered  networks  operating  as  classifiers  succeed  because  they 
perform  a  specific  nonlinear  discriminant  analysis  by  exploiting  subspace  methods. 

The  criterion  (24)  is  independent  of  the  transformation  from  *he  data  space  to  the  space 
of  hidden  unit  outputs;  i.e.  it  is  not  dependent  on  the  layered  structure  described  above, 
but  it  is  a  property  of  the  least-mean-square  solution  for  the  final  layer.  A  similar  expres¬ 
sion  has  been  described  by  Devijver  [3]  who  points  out  that  one  of  its  main  advantages  is 
that  it  is  a  feature  extraction  criterion  which  takes  into  account  the  costs  of  misclassifica- 
tion.  The  expression  is  also  related  to  optimisation  criteria  used  in  clustering  [10].  What 
we  have  shown  here  is  that  the  first  part  (up  to  the  outputs  of  the  final  layer  of  hidden 
units)  of  a  feed-forward  network  with  linear  transfer  functions  at  the  outputs  performs 
feature  extraction  which  maximises  the  criterion  (24),  and  the  subsequent  final  layer  per¬ 
forms  an  optimum  mapping  on  to  the  targets.  Thus,  optimising  a  feed-forward  network 
using  a  least-mean-square  approach  optimises  a  specific  feature  extraction  and  performs 
classification  simultaneously,  whereas  conventionally  these  operations  have  been  addressed 
separately.  Performing  feature  extraction  matched  to  discrimination  is  one  reason  why 
adaptive  networks  have  been  demonstrated  to  produce  good  classification  performance  over 
a  range  of  problems. 

An  advantage  of  solving  for  the  final  layer  weights  using  a  pseudo-inverse  approach 
(such  that  the  final  layer  weights  have  minimum  norm)  is  that  the  output  values  for  any 
subsequent  input  pattern  satisfy  a  particular  linear  constraint  which  depends  on  the  target 
coding  (see  fpp  -i.dix  B).  In  particular,  for  a  1-from-c  target  coding  scheme,  the  outputs 
sum  to  unity  for  an y  given  input  pattern.  For  this  target  coding  scheme,  not  only  does 
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the  final  transformation  approximate  the  posterior  probabilities  in  a  least-squares  sense, 
but  also  gives  outputs  with  the  additional  property  that  they  are  guaranteed  to  sum  to 
unity.  However,  it  may  not  be  possible,  necessarily,  to  interpret  the  outputs  individually 
as  probabilities,  since  the  additional  requirement  for  classical  probabilities  that  the  outputs 
lie  between  zero  and  unity  may  not  be  satisfied. 

In  this  section  we  have  described  one  scheme  (a  feed-forward  layered  network)  for  param- 
eterising  the  nonlinearities  of  a  generalised  discriminant  function.  This  allows  optimisation 
methods  to  be  used  in  order  for  the  subsequent  linear  transformation  to  be  a  better  ap¬ 
proximation  (in  the  case  of  1-from-o  coding  for  example)  to  the  posterior  probabilities.  In 
addition,  we  have  shown  that  minimising  the  error  maximises  a  particular  feature  extrac¬ 
tion  criterion  at  the  outputs  of  the  hidden  units,  and  the  pseudo-inverse  approach  naturally 
leads  to  outputs  which  satisfy  a  linear  constraint.  More  importantly,  for  l-£rom-c  coding, 
the  outputs  sum  to  unity. 

In  the  next  section,  we  shall  consider  various  combinations  of  target  coding  and  error 
weighting.  For  each  combination,  we  shall  evaluate  and  interpret  the  feature  extraction  cost 
function,  J ;  we  shall  derive  the  ‘optimal’  solution  in  the  limit  of  infinite  samples,  p;  we  shall 
evaluate  the  constraint  satisfied  by  the  outputs  and  derive  an  appropriate  decision  rule. 


4  Particular  Pattern  Weighting  and  Coding  Schemes 

4.1  Example  1 


Choose  uniform  weighting  for  each  pattern  in  the  training  set  (dp  =  l,p  =  1 ,...,/’).  In 
this  case,  the  matrix  Sj  may  be  expanded  out  explicitly  as 

=  (27) 

p=  i 

where  <j>p  is  the  hidden  layer  output  pattern  vector  for  pattern  p,  and  mH  is  the  mean 
pattern  vector  (defined  over  the  training  set)  at  the  output  of  the  hidden  units  whose  j-th 
component  may  be  expressed  as  Equation  (27)  is  the  traditional  total 

covariance  matrix  of  the  hidden  unit  output  patterns  evsduated  over  the  entire  training  set. 


For  a  1-from-c  target  coding  scheme,  the  matrix  Sb  may  be  expanded  explicitly 

*=1  v  ' 


(28) 


where  ns  is  the  number  of  patterns  in  class  i,  and  mj^  is  the  mean  pattern  vector  over  all 
hidden  unit  outputs  which  belong  to  class  k  in  the  training  set, 

n* 


Equation  (28)  is  recognised  as  the  weighted  (weighted  by  the  square  of  the  numbers  of 
patterns  in  each  class)  between  class  covariance  matrix.  Thus,  for  this  particular  target 
coding  scheme,  the  classes  with  largest  membership  dominate  the  transformation. 
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The  optimal  discriminant  vector  (in  the  sense  that  choosing  the  largest  component  as 
indicative  of  the  correct  class  gives  a  maximum  gain  decision),  in  the  limit  of  infinite  samples, 
for  a  given  input  pattern,  x,  is  given  by  Equation  (6)  and  for  the  1-from-c  coding  it  has 
components 

Pi  =  ?(«!*),  (29) 

the  posterior  probabilities. 

Also,  for  this  target  coding  scheme,  the  linear  constraint  satisfied  by  the  output,  o  = 
(oj, c>2, . . .  ,0c)*,  of  the  network,  for  an  arbitrary  input  pattern,  is 

e 

£°i  =  I-  (30) 

•=i 

This  is  a  property  required  by  a  network  which  encodes  probabilities  as  outputs,  though 
the  additional  property  0  <  o,  <  1,  t  =  1,. ..  ,c  cannot  be  guaranteed. 

The  minimum  distance  decision  rule  may  be  written  as 

assign  x  to  class  i  if  Oi  >  Oj,  j  =  1, ...  ,c, 

in  other  words,  select  the  largest  output  as  the  correct  class.  This  is  also  the  decision  rule 
which  would  apply  if  the  optimal  outputs  p(t|x)  were  attained  by  the  network. 


4.2  Example  2 

Choose  uniform  weighting  for  each  pattern  in  the  training  set  and  employ  a  target  :oding 
scheme  which  is  reciprocally  weighted  by  the  numbers  in  the  class.  Specifically,  consider 
the  target  value  of  class  k  for  pattern  p 

tv  =  (  (P/*k)>  if  £  k 

*  \  0  otherwise 

This  form  of  output  coding  means  that  the  gain  in  achieving  a  correct  classification  is 
inversely  proportional  to  the  square  root  of  the  proportions  of  samples  in  each  class  in  the 
training  set.  The  matrix  Sj-  is  again  the  covariance  matrix  of  hidden  unit  outputs,  but  in 
this  case,  the  matrix  Sb  expands  to 

s*  =  ±£K-">H)K -">")*  (»> 

*=i 

which  is  the  conventional  between  class  covariance  matrix.  Thus,  employing  the  above 
target  coding  scheme  produces  a  feature  extraction  criterion  which  prevents  a  network  from 
over-compensating  for  uneven  class  distributions. 

The  components  of  the  optimal  discriminant  vector  (again,  in  the  sense  that  choosing 
the  largest  component  as  indicative  of  the  correct  class  gives  a  maximum  gain  decision),  are 
given  by 
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and  the  constraint  satisfied  by  the  outputs  of  the  network  is 


(33) 


The  minimum  distance  decision  rule  is  to  assign  z  to  class  i  if 


j  =  X,...,c, 


If  the  network  were  to  achieve  the  optimal  output,  so  that  o;  =  p,,  then  by  Equation  (32) 
tJiiJPo,  would  be  equal  to  the  posterior  probabilities  and  the  above  decision  rule  is  to 
assign  se  to  class  i  if 

—  (l-2p(«|z))  <  —  (l-2p(j|*)>  i  =  1, . . . , c. 

n,  rij 


This  is  one  of  the  important  differences  between  a  1-from-c  coding  scheme  find  a  )- 
l.om-c  coding  scheme.  In  the  former  case,  for  a  network  achieving  the  optimal  Bayes 
discriminant  vector,  the  optimal  ‘maximum  gain’  decision  is  the  same  as  the  minimum 
distance  decision.  In  the  latter  situation,  if  the  network  achieves  an  output  vector  which  is 
equal  to  the  Bayes  vector  for  maximum  gain,  then  the  minimum  distance  decision  does  not 
give  a  Bayes  maximum  gain. 


4.3  Example  3 

Choose  a  uniform  weighting  for  each  pattern  and  employ  a  target  coding  scheme  which,  for 
a  pattern  in  class  i,  takes  as  the  target  vector  the  vector  a*,  whose  j-th  component  is  the 
cost  of  assigning  to  class  j  a  pattern  which  belongs  to  class  i.  In  this  case,  the  matrix  Sg 
is  given  by 

i(x;  (t  £%(«?-«">)’.  <»«> 

where  S  is  the  prototype  target  matrix  with  t'-th  column  a,. 

The  optimal  discriminant  vector  is 


p(*)  = 

•=i 


which  is  the  Bayes  conditional  risk  vector. 

The  minimum  distance  decision  rule  is  to  assign  z  to  class  i  if 

a*  a,  -  2o*a,  <  a*ay  -  2o'a;-  j  =  1, . . . ,  c 

Generally,  this  is  not  the  same  as  the  Bayes  minimum  risk  decision  rule. 
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4.4  Example  4 


Weight  each  pattern  in  the  training  set  according  to  the  a  priori  class  probabilities  and  the 
number  in  that  class  according  to 


d,= 


Pk 

n*/P 


for  pattern  p  in  class  k 


(35) 


where  P*  is  the  class  probability  (derived  from  prior  knowledge  regarding  the  relative  ex¬ 
pected  class  importance,  or  frequency  of  occurrence  in  operation)  and  ns  is  the  number 
of  training  patterns  in  class  k  in  the  training  set.  Using  this  error  weight  Lr^  the  total 
covariance  matrix  may  be  expressed  as 


*=i  nfc 


(36) 


This  is  the  sample-based  estimate  of  the  mixture  covariance  matrix.  In  this  expression,  the 
vector  mH  is  the  sample  based  estimate  of  the  population  mean  which  may  be  expressed 
as 


where 


*=1  ‘ 

*=1 


mf  i  ± 


(37) 


(38) 


is  the  estimate  of  the  mean  of  class  k. 


If  a  1-from-c  target  coding  scheme  is  employed  (as  in  Example  1  above)  along  with  the 
above  pattern  weighting  scheme,  the  between  class  covariance  matrix  is  modified  to 

Sb  =  Z  Pi  (mf  -  (mf  -  mw)  (39) 

h=l 

where  mj f  and  mR  are  given  in  (37)  and  (38).  This  may  be  interpreted  as  a  sample- 
based  estimate  of  the  weighted  between  class  covariance  matrix.  Similarly,  choosing  to 
weight  the  targets  according  to  a  l/y/Ti  -  from  -  n'  coding  along  with  the  above  pattern 
weighting,  leads  to  a  sample  based  estimate  of  the  conventional  between  class  covariance 
matrix  (Equation  (31)  above,  but  with  n*/P  replaced  by  the  prior  probabilities  P*  and 
taking  mjf  and  mH  as  defined  by  Equations  (37)  and  (38)). 

Note  that  the  limit  of  the  prior  probabilities  being  proportional  to  the  numbers  in  each 
class  reproduces  the  ‘standard’  model.  This  ansdysis  indicates  how  a  combination  of  target 
weighting  and  error  weighting  by  the  inclusion  of  prior  probabilities  in  the  training  scheme, 
induces  a  nonlinear  transformation  which  is  capable  of  compensating  for  different  class 
importance  or  pattern  distributions  between  classes.  For  instance,  if  in  a  c-class  problem 
the  occurrence  of  each  class  in  operation  is  considered  equally  likely  but  the  number  of 
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patterns  in  each  class  in  the  training  set  is  distributed  with  in  class  k,  then  one  should 
weight  each  training  pattern  according  to 

P 

ft  _  -  for  pattern  p  in  class  k  (40) 

cxnt 

In  addition,  using  a  l/y'TJ  -  from  -  c  target  coding  scheme  ensures  that  the  nonlinear 
transformation  into  the  space  of  the  hidden  units  involves  the  conventional  between  class 
covariance  matrix. 

The  optimal  discriminant  vector  (in  the  limit  of  infinite  samples  from  the  training  set, 
sampled  at  proportions  different  from  the  test  set)  has  components 

p>  =  53*<y(‘i*)  (41) 

•=l 

where  p'(i|*)  is  the  posterior  probability  in  the  test  set  given  a  pattern  *. 

Thus,  for  a  1-from-c  training  scheme,  p,  is  the  probability  of  class  i  given  the  test  data. 
For  a  general  input,  the  outputs  still  sum  to  unity  for  this  error  weighting  and  the  minimum 
distance  decision  rule  corresponds  to  the  Bayes  decision  rule  for  minimum  expected  error. 


5  Discussion. 


In  this  paper  we  have  viewed  the  problem  of  classification  with  feed-forward  networks  as 
an  optimisation  problem  using  generalised  linear  discriminant  functions.  The  weights  of 
the  discriminant  functions  and  the  parameters  of  the  nonlinear  functions  may  be  adapted 
simultaneously  using  a  hybrid  linear-nonlinear  optimisation  strategy  to  give  a  minimum 
output  error.  Thus,  the  network  performs  adaptive  feature  extraction  matched  to  the 
optimal  least-mean-squares  classification.  We  have  shown  that  minimising  the  sum-square 
error  at  the  output  of  the  network  is  equivalent  to  maximising  a  specific  feature  extraction 
criterion,  J,  at  the  outputs  of  the  hidden  units  given  by 

J  =  Tr{SBS$} 

where  Sg  and  St  may  be  interpreted  as  the  between-class  and  total  covariance  matrices 
of  hidden  unit  outputs.  The  precise  interpretation  depends  on  the  error  weighting  and 
target  coding  scheme.  We  have  also  exploited  relationships  between  deterministic  networks 
and  statistical  decision  theory  to  show  that  the  minimum  error  in  the  infinite  sample  limit 
is  an  upper  bound  for  the  error  for  the  Bayes  discriminant  function.  Therefore  training 
a  multilayer  perceptron  maximises  a  feature  extraction  criterion  and  minimises  an  upper 
bound  for  the  Bayes  error. 

The  choice  of  error  weighting  and  target  coding  is  governed  by  prior  knowledge  of  the 
classification  task.  We  highlighted  two  important  points  about  real-world  data  and  how 
to  compensate  for  their  effects  in  network  training.  The  first  point  is  that  in  many  pat¬ 
tern  processing  tasks  the  availability  of  data  constituting  a  training  set  may  not  reflect  the 
expected  distribution  of  patterns  ‘in  operation’.  In  order  to  maximise  the  likelihood  of  per¬ 
forming  correct  classification  on  the  test  set  the  training  set  patterns  need  to  be  weighted  by 
class-conditional  probabilities.  The  effect  that  this  weighting  has  on  the  feature  extraction 
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criterion  employed  in  the  network  was  discussed  in  Section  4.4.  The  second  point  is  that, 
even  if  the  distribution  of  patterns  in  the  training  set  can  be  considered  ‘representative’, 
there  is  usually  an  anisotropic  penalty  associated  with  misclassifying  patterns.  Thus,  if  it  is 
a  more  serious  error  to  misdassify  a  pattern  in  class  (»)  into  class  (j),  rather  than  a  pattern 
from  class  (j)  into  class  (*),  then  the  associated  ‘costs’  of  these  misclassifications  may  be 
incorporated  into  a  prototype  target  matrix.  The  effect  of  folding  costs  into  the  training 
phase  itself  was  considered  in  Sections  4.2  and  4.3  and  related  to  the  minimum  Bayes  risk 
decision  in  Section  2. 

In  addition,  it  has  been  shown  that  for  an  optimisation  scheme  which  solves  for  the  final 
layer  weights  using  a  pseudo-inverse  method,  the  outputs  of  the  network  satisfy  certain 
constraints.  In  particular,  for  a  1-from-c  target  coding  scheme,  the  pseudo-inverse  approach 
for  the  final-layer  weights  ensures  that  the  outputs  sum  to  unity.  This  hybrid  linear- 
nonlinear  optimisation  method  has  been  described  elsewhere  [20]. 

The  main  advantages  of  the  least-squares  approach  is  that  it  assumes  no  prior  knowledge 
of  the  probability  distribution  of  the  patterns.  However,  it  places  emphasis  on  regions 
of  highest  sample  density,  rather  than  on  regions  near  the  decision  boundary.  Several 
approximations  have  been  proposed  in  the  literature  which  remedy  this  (see,  for  example, 
[4]).  For  example,  the  error  weighting  introduced  in  Section  3  may  be  a  function  which 
reaches  a  maximum  at  the  boundaries  of  the  decision  region.  However,  in  the  types  of 
network  considered  in  this  paper,  the  use  of  logistic  transfer  functions  for  the  hidden  nodes 
means  that  the  boundary  is  mainly  influenced  by  data  points  near  to  it.  A  further  problem 
is  that  in  the  classical  least-square  problem,  there  is  an  underlying  assumption  that  the 
errors  are  in  the  target  matrix,  T,  whereas  in  reality  there  will  be  errors  in  the  training  set 
due  to  noise  on  the  data  and  measurement  errors.  In  such  a  situation,  a  total  least-squares 
approach  should  be  considered  [8], 

In  conclusion,  this  paper  has  elucidated  why  networks  perform  well  as  static  pattern 
classification  devices  and  pointed  out  situations  where  their  performance  is  biased  by  the 
prior  distribution  of  patterns  in  the  training  set.  A  general  regularisation  scheme  which 
compensates  for  the  discussed  uneveness  has  been  proposed,  and  the  consequences  that 
the  coding  scheme  has  on  the  feature  extraction  criterion  have  been  presented  analytically. 
Detailed  numerical  simulations  confirm  the  generic  statements  made  in  this  paper  and  are 
presented  elsewhere  [12]. 
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Appendix  A  The  Generalised  Network  Feature 
Extraction  Criterion. 


Minimising  the  weighted  error,  Equation  (23),  with  respect  to  the  Lias  on  the  outputs,  Ao, 
gives  the  solution 

(42) 

where  t  and  rnH  are  given  by 


Ao  —l  —  Am®, 

TP2! 
Ep=l  dP 
HD1 1 


t  = 


(43) 


£p=i  dp 

and  H  is  the  n0  x  P  matrix  of  hidden  unit  outputs.  Substituting  for  Ao  into  the  equation 
for  the  error  gives 

E  =  ~\\(t  -  AH)D||3,  (44) 

with  the  definitions 


and 


T  =  T  -IV 


H  5  H  -  m"l* 


(45) 

(46) 


The  matrix  A  which  minimises  the  error,  E,  with  minimum  norm  is  given  by 

A  =  TD(HD )  +  ,  (47) 

where  (H D)+  denotes  the  Moore-Penrose  pseudo-inverse  of  the  matrix  HD.  Substituting 
for  A  into  Equation  (44)  and  using  the  properties  of  the  pseudo-inverse,  the  error  may  be 
written  [20] 

E  =  jTr{TD^T' -tD7H'(HD7H')+HDit’},  (48) 

where  Tr  is  the  matrix  trace  operation. 

Thus,  since  the  targets,  T,  and  the  weights,  D ,  are  fixed,  minimising  the  error,  E,  is 
equivalent  to  maximising  the  discriminant  function 

J  =  i Tr{TD1H\HD1H,)*HDit *},  (49) 

at  the  outputs  of  the  hidden  units.  Equation  (49)  may  be  rearranged  to  give  [20] 


J  =  Tt{SbS+ }, 

(50) 

where 

ST  =  pBD'H', 

(51) 

and 

sB  =  -^hd't'td'h'. 

C52) 

This  generalises  the  result  in  [20]  where  the  matrix  D  was  taken  to  be  the  identitv 
matrix.  The  result  states  that  minimising  a  weighted  sum-squared  eiTor  at  the  outputs  of 
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a  feed-forward  network  with  linear  output  units,  and  in  which  the  final  layer  weights  are 
chosen  by  a  minimum  norm  solution,  maximises  a  feature  extraction  criterion  (given  by 
Equation  (50))  at  the  outputs  of  the  hidden  units.  The  feature  extraction  criterion  is  the 
trace  of  a  product  of  a  matrix  Sg  with  the  pseudo-inverse  of  a  matrix  Sr. 
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Appendix  B  Sum  Rules 


Theorem. 

Consider  a  network  having  linear  output  units.  Let  the  weights  associated 
with  the  connections  to  these  units  be  determined  by  linear  minimum  norm 
least  squares  optimisation.  Then  if  there  exists  an  arbitrary  linear  constraint 
of  the  form 

u't”  =  u’i  Vp=l,2 . P 

with  t»  a  constant  vector,  then  the  general  output  o  of  the  network  satisfies: 

1 1*0  =  u’i 


Proof 

The  general  output  of  the  network  is  given  by 

o  —  Aq  +  Ah 

where  h  is  the  output  of  the  hidden  units.  Using  the  solution  for  Ao,  Equation  (42),  and 
for  A,  Equation  (47),  this  may  be  written 

o  =  i  +  TD(HD)+  (h-mw) 

Therefore 

ti*o  =  u’l  +  u’TD{HD)+  (h-mH) 


u*T  =  «iT  -  «*il* 

By  hypothesis,  uT  =  u'Zl*.  Therefore 

u’o  =  u’t  • 


Remark:  If  the  set  of  target  vectors  satisfy  several  linear  constraints  simultaneously, 
then  so  will  the  general  network  outputs. 

As  a  special  case,  consider  the  quantity  1  T  in  the  situation  when  the  sum  of  each 
column  of  T  is  a  constant  (=  t  say); 


i’t  =  it  -  r?r 

1*TD*11* 


=  IT 


=  tl’  - 1 


=  0* 


T 

P 

i*pair 

Ep=idP 


(53) 


since  l’Dl  =  £dp. 


Page  19 


Therefore,  the  sum  over  the  outputs  of  the  trained,  optimised  network  is  given  by 
Sum  over  outputs  = 

l*o  =  1*1  (54) 

=  t,  Sum  of  each  column 

This  completes  the  proof  of  the  original  observation.  In  particular,  for  a  1-from-c  coding 
scheme  where  the  components  of  the  target  vector  belong  to  {1,0},  the  sum  of  the  outputs 
of  the  network  for  any  input  sum  to  unity. 
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